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ABSTRACT 

While the introduction of differential privacy has been a major 
breakthrough in the study of privacy preserving data publication, 
some recent work has pointed out a number of cases where it is 
not possible to limit inference about individuals. The dilemma that 
is intrinsic in the problem is the simultaneous requirement of data 
utility in the published data. Differential privacy does not aim to 
protect information about an individual that can be uncovered even 
without the participation of the individual. However, this lack of 
coverage may violate the principle of individual privacy. Here we 
propose a solution by providing protection to sensitive information, 
by which we refer to the answers for aggregate queries with small 
counts. Previous works based on ^-diversity can be seen as pro- 
viding a special form of this kind of protection. Our method is 
developed with another goal which is to provide differential pri- 
vacy guarantee, and for that we introduce a more refined form of 
differential privacy to deal with certain practical issues. Our empir- 
ical studies show that our method can preserve better utilities than a 
number of state-of-the-art methods although these methods do not 
provide the protections that we provide. 

1. INTRODUCTION 

The ultimate source of the problem with privacy preserving data 
publishing is that we must also consider the utility of the published 
data. The problem is intriguing to begin with because we have a 
pair of seemingly contradictory goals of utility and privacy. When- 
ever we are able to provide some useful information with the pub- 
lished data, there is the question of privacy breach because of that 
information. 

The statistician Tore Dalenius advocated the following privacy 
goal in (8): Anything that can be learned about a respondent from 
the statistical database should be learnable without access to the 
database. To aim for this goal, some previous works have consid- 
ered the approach where prior and post beliefs about an individual 
are to be similar 1181 1271 [41. As discussed in |13J, this privacy 
goal may be self-contradictory and impossible in the case of pri- 
vacy preserving data publication. The goal of the published data is 



for a receiver to know something about the population, it is by def- 
inition that the receiver can discover something about an individual 
in the population, and the receiver could happen to be the adver- 
sary. Due to the seeming impossibility of the above goal, research 
in differential privacy moves away from protecting the information 
about a row in the data table that can be learned from other rows 1 5l. 
The argument is that such information is derivable without the par- 
ticipation of the corresponding individual in the dataset, and hence 
is not under the control of the individual. However, although not 
under the individual's control, such information could nevertheless 
be sensitive. 

An important goal of our work here is to show that it is possible 
to protect sensitive information that can be acquired from the pub- 
lished dataset provided that the data publisher, with control over 
the dataset, can act on behalf of each individual. The principle of 
protecting individual privacy may dictate that the publisher either 
provides this kind of protection or does not publish the data. It 
is desirable that on top of ensuring that the participation of a user 
makes little difference to the results of data analysis, the publisher 
also guarantees protection for sensitive information that can be de- 
rived from the published dataset, with or without the data of the 
individual involved. Such a solution is our goal. While Dalenius's 
original goal may be impossible, it is also an overkill. A "relaxed" 
goal suffices: Anything "sensitive" that can be learned about a re- 
spondent from the statistical database should be learnable without 
access to the database. The obvious question is what should be 
considered sensitive. We provide a plausible answer here. 

Let us consider an example given in 1 14 1, where a dataset D' tells 
us that almost everyone involved in a dataset has one left foot and 
one right foot. We would agree that knowing with high certainty 
that a respondent is two footed from D' is not considered a prob- 
lem since almost everyone is two footed. Note that even if an indi- 
vidual does not participate in the data collection, the deduction can 
still be made based on a simple assumption of an i.i.d. data genera- 
tion. Differential privacy and all of the proposed privacy models so 
far do not exclude the possibilities of deriving information of such 
form. In fact, by definition of data utility, such a derivation should 
be supported. This example is not alarming since it involves a large 
population. However there will be cases where the information be- 
comes sensitive and requires protection. Let us consider a medical 
data set. Suppose lung cancer is not a common disease. Also sup- 
pose there are only five females aged 70, with postal code 2980 
and all of them have lung cancer, the linkage of the corresponding 
(gender, age, postal code) with lung cancer in this case is 100%; if 
we maintain high utility for accurately extracting such information 
or concepts, the privacy of the five females will be compromised. 
The reason why this is alarming is because accurate answers to 



queries of small counts can disclose highly sensitive information. 
A problem with many existing techniques lie in non-discriminative 
utilities for all concepts. We propose to consider discriminative 
utilities which are based on the population sizes: queries involv- 
ing large populations can be answered relatively accurately while 
queries with a very small population base should not. A similar 
idea is found in the literature of security for statistical databases [X, 
[30l|20l|2T]|9) (see Section[TO]on related work). 

Protecting queries of small counts is implicit in many previous 
works. For example, the principle of ^-diversity |25 | essentially 
protects against accurate answers to queries about the sensitive val- 
ues of individuals, which may become small count queries given 
that the adversary has knowledge about the non-sensitive values of 
an individual and therefore is capable of linkage attack f29', 1281. 
We shall show that our mechanism provides better protection when 
compared to ^-diversity approaches. 

Our major contributions are summarized as follows. We point 
out the dilemma that utility is a source of privacy breach, so that 
on top of differential privacy we must also protect sensitive infor- 
mation that can be derived from the published data. We propose 
a mechanism for privacy preserving data publication which pro- 
vides three lines of protection: (1) differential privacy to protect 
information that may be attained from the data of an individual 
tuple, (2) protection for concepts with small counts which can be 
derived from the entire published data set, and (3) a guarantee that 
the published data does not narrow down the set of possible sensi- 
tive values for each individual. We enforce a stronger e-differential 
privacy guarantee by setting e = 0. We support discriminative util- 
ities so that concepts with large counts can be preserved. While l- 
diversity methods are vulnerable to adversary knowledge that elim- 
inates £ — 1 possible values, our method is resilient to such attacks. 
We have conducted experiments on a real dataset to show that our 
method provides better utilities for the large sum queries than sev- 
eral state of the art methods which do not have the above guarantee. 

The rest of the paper is organized as follows. In Section 2, we re- 
visit e-differential privacy for non-interactive database sanitization. 
We point out issues about e and about known presence. Then we 
introduce our model of f'-diverted zero-differential privacy. Sec- 
tion 3 describes a first attempt of a solution using an existing ran- 
domization method, we show that this method cannot guarantee 
zero-differential privacy. Section 4 describes our proposed mecha- 
nism A' which generates D' . Section 5 is about count estimation 
given D' . Section 6 shows that mechanism A' supports high utility 
for large counts and high inaccuracies for small counts. Section 7 
is about multiple attribute aggregations. Section 8 is a discussion 
about auxiliary knowledge that may be possessed by the adversary. 
Section 9 reports on the empirical study. Related works are sum- 
marized in Section 10 and we conclude in Section 11. 

2. ^'-DIVERTED PRIVACY 

Our proposed method guarantees a desired form of differential 
privacy with the additional protection against the disclosure of sen- 
sitive information of small counts. In this section we shall introduce 
our definition of privacy guarantee based on differential privacy. 
First we examine some relevant definitions from previous works. 
The following is taken from |6|. 

Definition I (A{D) and e-differential privacy). 
For a database D, let A be a database sanitization mechanism, 
we will say that A{D) induces a distribution over outputs. 
We say that mechanism A satisfies e-differential privacy if 
for all neighboring databases Di and D2 (i.e. D\ and D2 
differ in at most one tuple), and for all sanitized outputs D, 



Pr[A{Di) = D]< e'Pr[A{D2) ^ D] q 

The above definition says that for any two neighboring 
databases, the probabilities that A generates any particular dataset 
for publication are very similar. However, there are some practical 
problems with this definition. 

2.1 The problem with e 

In e-differential privacy, the parameter e is public. The sanitized 
data is released to the public, and the public refers to a wide spec- 
trum of users and applications. It is not at all clear how we may 
have the parameter e decided once and for all. In II4I . it is sug- 
gested that we tend to think of e as, say, 0.01, 0. 1, or in some cases. 
In 2 or In 3. Evidently the value can vary a lot. For example, for 
the above suggested values, ranges from I.Ol, 1.105 to 2 and 3. 

A second problem with the setting of e is that it may compro- 
mise privacy. Suppose that for all pairs of neighboring datasets Di 
and D2, where D2 contains t while Di does not, Pr[A{Di) = D] 
is 1/3, while Pr[A{D2) = D] is 1. If we set e to In 3, then e- 
differential privacy is satisfied, but the existence of t can be esti- 
mated with 75% confidence. 

The above concerns call for the elimination of e. We can do 
so by setting e to zero. This is in fact the best guarantee since it 
means that there is no difference between Di and D2 in terms of 
the probability of generating D' . We shall refer to this guarantee as 
zero-differential privacy. 

2.2 The issue of known presence 

While the initial definition of differential privacy aims at hiding 
the presence or absence of an individual's data record, it is often 
the case that the presence is already known. As discussed in 1 14|, 
in such cases, rather than hiding the presence, we wish to hide cer- 
tain values in an individual's row. We shall refer to such values that 
need to be hidden as the sensitive values. The definition of differen- 
tial privacy need to be adjusted accordingly. The phrase "differ in 
at most one tuple" in Definition[T]can be converted to "have a sym- 
metric difference of at most 2". This is so that in two datasets Di 
and D2, if only the data for one individual is different, then we shall 
find two tuples in the symmetric difference of Di and D2. The two 
tuples are tuples of the same individual in the two datasets, but the 
sensitive values differ. However, with this definition, the counts for 
sensitive values in Di or D2 would deviate from the original data 
set D. For a neighboring database, we prefer to preserve as much 
as possible the characteristics in D. In the following subsection, 
we introduce a definition of differential privacy that addresses the 
above problems. 

2.3 -diverted zero-differential privacy 

Given a dataset (table) D which is a set of tuples, the problem 
is how to generate sensitive values for the tuples in D to be pub- 
lished in the output dataset D' . We assume that there are two kinds 
of attributes in the dataset, the non-sensitive attributes (NSA) and a 
sensitive attribute (SA) S. Let the domain of S be domain{S) = 
{si, Sm}. We do not perturb the non-sensitive values but we 
may alter the sensitive values in the tuples to ensure privacy. We 
first introduce our definition of neighboring databases which pre- 
serves the counts of sensitive values, and we minimize the moves 
by swapping the sensitive values of exactly one arbitrary pair of tu- 
ples with different sensitive values. In the following we use t.s to 
denote the value of the sensitive attribute of tuple t. 

Definition 2 (neighbor W.R.T. t). Suppose we have two 
databases Di and D2 containing tuples for the same set of individ- 
uals, and Di and D2 differ only at two pairs of tuples, t, t in Di 



and t' , t' in D2. Tuples t and t' are for the same individual, and t 
and t' are for another individual, with t.s 7^ t.s, t.s = t' .s, and 
t.s — t' .s. Then we say that D2 is a neighboring database to Di 
with respect to t. 

Our definition of neighbors bears some resemblance to the con- 
cept of Bounded Neighbors in 1231 , where the counts of tuples are 
preserved. As in |23|, our objective is a good choice of neighbors 
of D ( the original dataset) which should be difficult to distinguish 
from each other. Our differential privacy model retains the essence 
in Definition[T]from [6|. 

Definition 3 (/-diverted privacy). We say that a non- 
interactive database sanitization mechanism A. satisfies i' -diverted 
zero-differential privacy, or simply I' -diverted privacy, if for any 
given Di, for any tuple t in Di, there exists £' — 1 neighboring 
databases D2 with respect to t, such that for all sanitized outputs 
D, Pr[A{Di) = D]= Pr[A{D2) = D]. 

The above definition says that any individual may take on any of 
£' different sensitive values by swapping the sensitive information 
with other individuals in the dataset, and it makes no difference in 
the probability of generating any dataset D. It seems that our def- 
inition depends on the parameter £' . However, not knowing which 
£' ~ 1 neighboring databases it should be in the definition, an ad- 
versary will not be able to narrow down the possibilities. There- 
fore, even in the case where the adversary knows all the informa- 
tion about all individuals except for 2 individuals, there is still no 
certainty in the values for the 2 individuals. 

Our task is to find a mechanism that satisfies ^'-diverted zero- 
differential privacy while at the same time supports discriminative 
utilities. The use of Laplace noise with distribution Lap{f/e) is 
common in e-differential privacy fT2l . However, this approach 
will introduce arbitrary noise when e becomes zero and it is de- 
signed for interactive query answering. We need to derive a differ- 
ent technique. 

3. RANDOMIZATION: AN INITIAL AT- 
TEMPT 

In the search for a technique to guarantee a tapering accuracy for 
the estimated values from large counts to small counts, the law of 
large numbers 1 24] naturally comes to mind. Random perturbation 
has been suggested in |30|, the reason being that "If a query set 
is sufficient large, the law of large numbers causes the error in the 
query to be significantly less than the perturbations of individual 
records." Indeed, we have seen the use of i.i.d. for the randomiza- 
tion of datasets with categorical attributes. In O, an identity per- 
turbation scheme for categorical sensitive values is proposed. This 
scheme keeps the original sensitive value in a tuple with a probabil- 
ity of p and randomly picks any other value to substitute for the true 
value with a probability of (1 — p), with equal probability for each 
such value. Theorem 1 in I?) states that their method can achieve 
good estimation for large dataset sizes. Therefore, it is fair to ask 
if this approach can solve our problem at hand. Unfortunately, as 
we shall show in the following, this method cannot guarantee zero- 
differentiality unless p is equal to 1/M with a domain size of M, 
which renders the generated data a totally random dataset. Let us 
examine this approach in more details. 

Suppose that the tuple t of an individual has sensitive value t.s 
in D. The set of sensitive values is given by {s\, ...,Sm}. We 
generate a sanitized value for the individual by selecting Si with 
probability pi, so that 

{p for Si = t.s; 
q for Si / t.s 



where X] . pi = 1. 

Let us refer to the anonymization mechanism above by A. 

Let D' be a dataset published by A which contains a tuple for 
individual /. Consider two datasets D\ , D2 which differ only in 
the sensitive value for the single tuple for I. 

We are interested in the probability Pr[A{Di) = D'] that D' is 
generated from Di by A, and Pr[A{D2) — D']. In particular, we 
shall show that when p — q, A is zero-differential. 

Let the tuples in Di be t\,...,t\f. Let the tuples in D2 be 

,2 ,2 

Lemma 1. For mechanism A, if p — q, then A satisfies zero- 
differential privacy according to Definition [7] with neighboring 
databases having a symmetric difference of at most 2. 

Proof: Since all the non-sensitive values are preserved and only 
the sensitive values may be altered by A, we consider the proba- 
bility that each tuple in Di or D2 may generate the corresponding 
sensitive value in D' . For Dh, k G {1,2}, let pk{ti, Sj) be the 
probability that A will generate Sj for tuple ti . 

Mechanism A handles each tuple independently. Hence 
Pr[A{Dk) = D'] is a function of pfc(t*^,Sj) foralH, j,fc £ {1,2}. 

Pr[A{Dk) ^ D'] = /(pfe(tJ,Sl),Pfe(ti',S2),...,Pfc(tJ,Sm), 
■■■■Pk{t%,Sl), ...,Pk{t%,Sm)) 

Given a tuple t with sensitive value t.s, the probability that a sen- 
sitive value Sj will be generated in D' for t depends only on the 
value of t.s. 

Without loss of generality, let Di and D2 differ only in the sen- 
sitive value for tr . We have pi{tl, Sj ) = P2{ti , Sj) for all j and all 
i ^ r. Obviously if we set p = g = ^ , then the probability to gen- 
erate any value given any original sensitive value will be the same. 
Although t^.s 7^ t'^.s, we have pi{tl., Sj) — P2(<r, Sj) for all j. 
Hence Pr[A{Di) = D'] = Pr[A{D2) = D'] and Mechanism A 
satisfies zero-differential privacy. □ 

Lemma 2. For mechanism A, ifp 7^ q, then A does not satisfy 
zero-differential privacy according to Definition [7] with neighbor- 
ing databases having a symmetric difference of at most 2. 

Proof: We prove by constructing a scenario where we are given 
datasets D\ , D2 differing in only one tuple, and a sanitized table 
D', and Pr[A{Di) = D'] ^ Pr[A{D2) = D']. Consider the case 
where all tuples are unique in terms of the non-sensitive attributes. 

For 1 < i < n, let pi(ti) = p if .s = ti.s, and pi(ti) = g if 
tl.s / t,.s. We have Pr[^(L)i) = D'] = YliPl{i^)■ Similarly 
we define p2(tO for 1 < i < n. Pr[A{D2) = D'] = UiPA^)- 
Furthermore, let tj-.s = tk.s and t|.s 7^ tk.s. Therefore, for tk, 
pi(tfc) = p andp2(tfe) = q, while pi(ii) = P2{ti) for i / k. 

Pr[A{Di) ^ D'] _ p 
Pr[^(D2) = D'] ^ q 

Since p / g, it follows that Pr[A{Di) = D'] ^ Pr[A{D2) = 
D'] and therefore A does not satisfy zero-differential privacy. □ 

Lemma 3. For mechanism A, if p 7^ q, then A is not £'- 
diverted zero-differential according to Definition \3\ 

Proof: We proof by showing a scenario where given D\, and 
a neighboring database D2 with respect to a tuple t, and an 
anonymized table D' , Pr[A{Di) = D'] / Pr[A{D2) = D']. 
Let Di and D2 agree on all tuples except for t\ and t\ in D\ and 
corresponding tuples t^ and t1 in D2. Let all tuples be unique in 
terms of the non-sensitive attributes. 



For 1 < i < n, letpi{ti) — pif tj.s = tt.s, and pi(ti) — qif 
tl.s / U.S. We have Pr[{A{Di) = D'] = n,Pi(iO- Similarly 
we define p2(fO for 1 < i < n. Pr[A(L»2) = D'] = IliPaCtO- 
Furthermore, let ta.s = t^-s and t^.s — tt.s, also t;^.s 7^ ta.s 
and tft.s 7^ ti,.s. Therefore , for tk, Pi{ta) ~ p, Pi{ta) ~ p and 
P2{ta) = q,P2{tb) = q, while pi(ti) = P2{U) fori {a,b}. 

Pr[A{Di) = D'] _ p^ 
Pr[A{D2) = D'] ^ ^ 

Since p / <?, it follows that Pr[A{Di) = D'] / Pr[A(D2) = 
D'] and therefore A is not zero differential. □ 

From the previous analysis, in order to make the probability 
Pr[A{Di) = D'] equal to Pr[A{D2) = D'], all values need 
to be selected with probability equal to ^ . This would be the same 
as random data and it would have great cost in the utility. 

4. PROPOSED MECHANISM 

From the previous section on mechanism A, we see that for 
generating a dataset D' from a given dataset D, randomization 
with uniform probability can attain zero-differentiality. However, 
if the probability is uniform over the entire domain, the utility 
will be very low. Here we introduce a simple mechanism called 
A' which introduces uniform probability over a subset of the do- 
main. We shall show that this mechanism satisfies ^'-diverted zero- 
differential privacy without sacrificing too much utility. We make 
the same assumption as in previous works [25,. 34 J that the dataset 
is eligible, so that the highest frequency of any sensitive attribute 
value does not exceed N/l' . Furthermore we assume that A'^ is a 
multiple of I' (it is easy to ensure this by deleting no more than 
I' ~ 1 tuples from the dataset). 

4.1 Mechanism A' 

Mechanism A' generates a dataset D' given the dataset D. We 
assume that there is a single sensitive attribute {SA) S in D. We 
shall show that A' satisfies ^'-diverted zero-differential privacy. 
There are four main steps for A': 

1. First we assume that the tuples in D have been randomly as- 
signed unique tuple id's independent of their tuple contents. 
Include the tuple id as an attribute id in D. The first step 
of A' is an initialization step, whereby the dataset D goes 
through a projection operation on id and the SA attribute 
S. Let the resulting table be Ds. That is, Ds = Ii-id,s(D). 
Note that the non-sensitive values have no influence on the 
generation of Ds . 

2. The set of tuples in Ds is partitioned into sets of size I' each 
in such a way that in each partition, the sensitive value of 
each tuple is unique. In other words, let there be r partitions. 
Pi , . . . , Pr ; in each partition Pi , there are i' tuples, and I' dif- 
ferent sensitive values. We call each partition a decoy group. 
If tuple t is in Pj , we say that the elements in Pj are the de- 
coys for t. We also refer to Pj as P(t). With a little abuse of 
terminology, we also refer to the set of records in D with the 
same id's as the tuples in this decoy group as P(t). 

One can adopt some existing partitioning methods in the lit- 
erature of ^-diversity. We require that the method be deter- 
ministic. That is, given a Ds, there is a unique partitioning 
from this step. 

3. For each given tuple t m Ds, we determine the partition 
P{t). Let the sensitive values in P{t) be {s'l, s'gi}. For 



each of these decoy values, there is a certain probability that 
the value is selected for publication as the sensitive value for 
t. For a value not in {s'l, s^/}, the probability of being 
published as the value for t is zero. In the following we shall 
also refer to the set {s'l, s^,} as decoys(t). 

Suppose that a tuple t has sensitive value t.s in D. Create 
tuple t' and initialize it to t. Next we generate a value to 
replace the S value in t' by selecting Si with probability pi, 
so that 

Pi = p for Si — t.s 

Pi = q = {1 — p)j7^ for Si / t.s, Si £ decoy s{t) 

Pi — for Si decoys(t) 

4. The set of tuples t' created in the previous step forms a table 
Ds' . Remove the s column from D, resulting in _Djv. Form 
a new table D' by joining Ds' and Dn and retaining only 
NSA and S in the join result. The tuples in D' are shuffled 
randomly. Finally D' and £' are published. r-i 



Algorithm 1 - Mechanism A' 

Require: D with A'^ tuples,with random tuple id's, sensitive at- 
tribute 5", and set of non-sensitive attributes NSA 
1: table Ds^nid,s(-D) 

2: partition Ds into decoy groups of size £' each 

so that each decoy group has £' unique sensitive values. 
3: for each partition P do 
4: for each tuple t in P do 
5: let decoy s{t) = {s'l, s'l, ] 
6: create tuple and set t'. id = t. id 
7: if t.s = s'i then 
8: sett' .s = s'i with probability p 

set t'.s to s'j 7^ s'i with probability q 
9: let Ds' be the set of tuples t' created in the above 
10: D' ^ nNSA,s{{Ii,d,NSAD) Ds') 
11: shuffle tuples in D' and publish D' and £' 
12: {Note that no other information about the partitions is pub- 
lished } 



The pseudocode for mechanism A' is given in Algorithm 1. 
At first glance, mechanism A' looks similar to partitioning based 
methods for ^-diversity |25 1, in fact, for the second step in A' , we 
can adopt an existing partitioning algorithm such as the one in 1341 
which has been designed for bucketization. However, A' differs 
from these previous approaches in important ways. 

Firstly, the generation of dataset D' is based on a probabilistic 
assignment of values to attributes in the tuples. There is a non-zero 
probability that an SA (sensitive attribute) value that exists in D 
does not exist in D' . In known partitioning based methods, the SA 
values in D' are honest recording of the values in D, although in 
some algorithms they may be placed in buckets separated from the 
remaining values. 

Secondly, the partitioning information is not released by A' , in 
contrast to previous approaches, in which the anonymized groups 
or buckets are made known in the data publication. For ^-diversity 
methods, since the partitioning is known, each tuple has a limited 
set of £ possible values. By withholding the partitioning informa- 
tion, plus the possibility that a value existing in D may not exist in 
D', there is essentially no limit to the possible values for S except 
for the entire domain for any tuple in D' . 



4.2 ^'-diverted zero-differentiality guaruantee 

For the privacy guarantee, we shall show that if p = g, then A' 
satisfies ^'-diverted zero-differential privacy, otherwise, it does not. 
First we need to state a fact about A' . 

Fact 1. In mechanism A', let p — q = jf^, so that p = q = 
jr. When executing A', for two tuples t and t' in the same partition 
(P{t) = P{t')), and any sensitive value Si, the probability that t 
will be assigned Si by A' is equal to that for t' . 

The following theorem says that we should set p = g = l/i"' in 
mechanism A' . 

Theorem 1. For mechanism A', ifp ~ q ~ jr, then A' satis- 
fies I -diverted zero-differential privacy. 

Proof: Let D' be a published dataset. Given a dataset Di which 
may generate D' , and a tuple t in D\, we find I' — 1 neighboring 
databases D2 as follows: 

We execute A' on top of Di. In the first step, D], is generated 
from Di. In the second step, Dl is partitioned into sets of size I' . 
Let P{t) be the partition (decoy group) formed by A' for t in Di. 
Pick one element t in Pit) where t ^ t. Form D2 by swapping 
the non-sensitive values of t and t m Di. By definition D2 is a 
neighboring database of D\ . 

Let Di be the table generated from Step 1 of A' on D2. Since we 
have only swapped the non-sensitive values of t and t, D\ = Di. 
The partitioning step of A' is deterministic, meaning that the same 
partitioning will be obtained for D\ and D2. From the above, we 
know that t and t are in the same partition for Di, i.e., P{t) = 
P{t). When we consider the generation of sensitive values for t and 
t, since they are in the same partition P{t), by Fact[T] they have the 
same probabilities for different outcomes. Since the SA values for 
different tuples are generated independently, Pr[A' {D\) — D'] = 
Pr[A'{D2) = D']. 

There are £' — 1 possible D2 given Di, we have shown that A' 
satisfies ^'-diverted zero-differential privacy. [-] 

Theorem 2. For mechanism A', ifp 7^ q, then A' does not 
satisfy £ -diverted zero-differential privacy. 

Proof: We prove by giving an instance where A' is not l'- 
diverted zero-differential. We say that a dataset D is ^'-consistent 
with D' if there is a non-zero probability that D' is generated from 
Dhy A' . Consider D\ and D2, each being consistent with D' . Let 
the tuples in D\ be. ...,t\i . Let the tuples in D2 he, t\, ...,t%[. 
The two sets of tuples are for the same set of individuals. Further- 
more, assume that D\ and D2 differ in only 2 tuples for a pair of 
individuals; let the pair of tuples in Di be t\, t\, and that in D2 
be ti,t1. Assume that all tuples have unique non-sensitive values, 
and 

,1 ,/ ,1 ,/ 

t^a.S / t'a.S, tl.S / t'i,.S 

j-1 — +2 ,2 _ ,1 
tfj.S Tjj.S^ ^a-^ t'l} ^ S 

For 1 < i < A'^, let pi{ti) = p if tj .s — t'^.s, and pi{ti) — q if 
tj.s 7^ t'i.s. Similarly we define p2(ti) for 1 < i < TV. 

Therefore , for fa, pi(ta) = pi{tb) =pandp2(ia) = P2(tb) = 
q, while pi(ti) = P2{U) for i {a, b}. 

Pr[A'{D,)^D'] _ n,PiftO 
Pr[A'{D2) ^ D'] n,P2(tO g2 

Since p / g, it follows that Pr[A'{Di) = D'] / Pr[A'{D2) = 
D'] and therefore A is not zero differential. q 



The above theorems show that in order to enforce i"-diverted 
zero-differential privacy, we should set both p and g to 1/1' . This 
will be the assumption in our remaining discussions about A' . 

5. AGGREGATE ESTIMATION 

In this section we examine how to answer counting queries for 
the sensitive attribute based on the published dataset D' . 

Let \D\ — N,so that there are A'' tuples in D. Consider a sensi- 
tive value s. Let the true frequency of s in D be fs. By mechanism 
A', there will be fs decoy groups which contain s in the decoy 
value sets. Each tuple in these groups has a probability of p = jr 
to be assigned s in D'. The probability that it is assigned other 
values s is 1 — p. There are fsi' such tuples. 

Let A'^^ denote the number of times that s is published in D' . The 
random variable Ni. has the binomial distribution with parameters 
fsi' and p. 

P[Ni=x] = |^-^=J'jp-(l-p)/=^'-- 

The expected value is fsi'p, and = fs£'p{l — p) 

Since we set p = g = !/£' , the expected count of s in D' is 
given by es = p £' fs = /s, we have 

es = fs 

That is, to estimate the true count of an SA value s, we simply take 
the count of sin D' , fi. 

Theorem 3. The estimation of fs by f'g is a maximum likeli- 
hood estimation (MLE). 

Proof. Let L(D) be the likelihood of the observation fs in D', 
given the original dataset D. L{D) — Pr{f's\D) 

From Mechanism A', given fs occurrences of s in D, there will 
be exactly £' fs tuples that generates s in D' with a probability of p. 
The remaining tuples have zero probability of generating a s value. 
The probability that f's occurrences of s is generated in D' is given 
by 

L{D) = Pr{f's\D) = (^^'l,'y'Hl~pf'''-^'^ 

where p — l/l' . 

This is a binomial distribution function which is maximized 
when f's is at the mean value of £' fsP = fs. □ 

To examine the utility of the dataset D' , we ask how likely f's is 
close to fs, and hence the estimation Cs is close to the true count 
/s? However, we also need to provide protection for small counts. 
In the next section we shall analysis these properties of the pub- 
lished dataset. 

6. PRIVACY, UTILITY, AND THE SUM 

As discussed in Section [T] the utility of the dataset must be 
bounded so that for certain facts, in particular, those that involve 
very few individuals, the published data should provide sufficient 
protection. Here we consider the relationship between the utility 
and the number of tuples n that is related to a sensitive value. Is 
it possible to balance between disclosing useful information where 
n is large and hence safe and not disclosing accurate information 
when n is small and hence need protection? We explore these is- 
sues in the following. 



6.1 Utility for large sums 

To answer the question about the utihty for large sums, we make 
use of the Chebychev's inequality which gives a bound for the like- 
lihood that an observed value deviates from its mean. 

Chebychev's Theorem: If X is a random variable with mean ^ 
and standard deviation cr, then for any positive k, Pr{\X — fi\ < 
kcr) > 1 - and Pr{\X - ii\ > ka) < □ 

Let ^1 , X2 , Xn , ... be a sequence of independent, identically 
distributed random variables, each with mean /i and variance a^. 
Define the new sequence of Xi values by 



Xj^ 



1 " 

rt ^ — ^ 



1,2,3, 



< 



From Chebychev's inequality, P [\Xn — tJ-^x^ \ > kaj^^^ _^ ^2 
where = E[Xn] = M, crx,, = E[{X„ — n)'^] = ^ and k is 
any positive real number. Choose k = for some e > 0, we get 



Pr [\X„ - mI > e] < 



(1) 



The above reasoning has been used to prove the law of large 
numbers. Let us see how it can help us to derive the utility of 
our published data for large sums. If there are fs tuples with s 
value, then n — I' fs tuples in D will have a probability of p to 
be assigned s in D' . The setting of value s to the tuples in D' 
corresponds to a sequence of £' fs independent Bernoulli random 
variables, Xi, .., , Xc each with parameter p. Here Xi — 1 
corresponds to the event that s is chosen for the i-th tuple, while 
Xi — corresponds to the case where s is not chosen. 



The mean value fij^ — p. 
Inequality (TJ, 

Pr nX„ - u| > el < 



Also, CT^ = p{l — p)/n. From 

P(l -P) ^ P(l -p) 
e2n2 e2^'2/2 



From Section |23] we set p = jr, hence 

Pr[\Xr.-t^\>e]<jj^ (2) 

Note that Xn is the count of s in D' divided by n, and n = 
i' fs- Hence the occurrence of s in D' is fs — £' fsXn. Rewriting 
Inequality we get 



Pr [\£'fsX„ - £'fsfi\ > i'fse] < 



1 



Since ^^=p= !/£', Pr [\f's - /,| > I'efs] < 

With the above inequality, we are interested in how different fi, 
is from fs. Since the deviation is bounded by I'efs, it is better to 
use another variable e — £'e to quantify the difference. 

fs\>£fs]<jr^ (3) 



Pr [\f's 



Our estimation is es — fs, hence the above gives a bound on the 
probability of error in our estimation. If fs is small, then the bound 
is large. In other words the utility is not guaranteed. This is our 
desired effect. 

Given a desired e and a desired £', we may find a frequency 
threshold r/ so that for fs above this threshold, the probability 
of error in Inequality l|3j is below another threshold Te for utility. 
We can set the RHS in the above inequality to be this threshold. 
Obviously, jr^j^ < Te for fs > Tf 
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Figure 1: Relationship between Te and 7} 

Definition 4 (Thresholds Te and 7}). Given an origi- 
nal dataset D and an anonymized dataset D'. A value s has a (e, 
TE,Tf) utility guarantee if for a frequency fs of s above the fre- 
quency threshold ofTf in D, 



Pr [\f's - fs I > £fs] < TEfor fs > Tf 



(4) 



The above definition says that a value s has a (e, TE,Tf) guar- 
antee if whenever the frequency fs of s is above 7/ in D, then the 
probability of a relative error of more than e is at most Te- 

Lemma 4. Mechanism A' provides a (e, TE,Tf) utility guar- 
antee for each sensitive value, where 

Hence given, e and Te, we can determine the smallest count 
which can provide the utility guarantee. 

Example 1. Consider some possible values for the parame- 
ters. Suppose Te = 0.02 and £' = 10. If e = 0.2, then Tf = 11. 
Ife = 0.001, or e = 0.02 then Tf = 49. g 

Figure [T] shows the relationship between the possible values of 
Tf and Te. The utility is better for small Te, and the value of Te 
becomes very small when the count is increasing towards 900. Note 
that utility is the other side of privacy breach, it also means that for 
concepts with large counts, privacy protection is not guaranteed, 
since the accuracy in the count will be high. 

6.2 Privacy for small sums 

Next we show how our mechanism can inherently provide pro- 
tection for small counts. From Inequality ([S}, small values of fs 
will weaken the guarantee of utility. We can in fact give a proba- 
bility for relative errors based on the following analysis. 

The number of s in D' is the total number of successes in fs£' 
repeated independent Bernoulli trials with probability jr of success 
on a given trial. It is the binomial random variable with parameters 
n = £' fs and p = -p-. The probability that this number is x is given 
by 



X n — x 

p q = 



e'fs 

X 



1 - 



£' 



Example 2. If fs = 5, £' = 10, for an e = 0.3 bound on the 
relative error, we are interested to know how likely fs is close to 5 
within a deviation of 1. The probability that fs is between 4 to 6 is 
given by 
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Figure 2: Expected error for small sums 

Hence the probabilty that fi. deviates from fs by more than 0.3/s 
is about 0.52. 

Definition 5 (privacy guarantee). We say that a sensi- 
tive value s has a (e, Tp) privacy guarantee if the probability that 
the estimated count of s, f'^, has a relative error of more than e is 
at least Tp. 

In Example|2l the value s has a (0.3, 0.48) privacy guarantee. A 
graph is plotted in Figure[2]for the expected error for small values 
of fs. Here the summation in the above probability is taken from 
fs = \0.7fs] to fs = [_l.Sfs\. We have plotted for different 
values the probability given by 

L1.3/.J 

1- E 

This graph shows that the relative error in the count estimation is 
expected to be large for sensitive values with small counts. 

7. MULTIPLE ATTRIBUTE PREDICATES 

In this section we consider the counts for sets of values. For ex- 
ample, we may want to know the count of tuples with both lung 
cancer and smoking, or the count of tuples with gender = female, 
age = 60 and disease = allergy. The problem here is counting the 
occurrences of values of an attribute set. Firstly we shall consider 
counts for predicates involving a single sensitive attribute, then we 
extend our discussion to predicates involving multiple sensitive at- 
tributes. 

7.1 Predicates involving a single SA 

Assume that we have a set of non-sensitive attributes NSA and 
a single sensitive attribute SA, let us consider queries involving 
both NSA and SA. We may divide such a query into two com- 
ponents: P and s, where P £ domain{NA) (NA C NSA), 
and s G domain{SA). For example P = [female, QQ) and 
s — (allergy). Note that the non-sensitive attributes are not dis- 
torted in the published dataset. This can be seen as a special case 
of generating a non-sensitive value for the individual t by selecting 
Si with probability pi, so that 



for Si = t.s; 
for Si 7^ t.s. 



Suppose we are interested in the count of the co-occurrences of 
non-sensitive values P and SA s. 

Definition 6 (state i). There are 4 conjunctive predicates 
concerning P and s, namely, po = P As, pi = P A s, p2 = P A s, 
and p3 = P A s. If a tuple satisfies pi, we say that it is at state i. 



The distributions of the predicates in D and D' are given by 
cnt{pi) and cnt'{pi), respectively. Here cnt{pi)(cnt' (pi)) is the 
number of tuples satisfying pi in D (D'). 

For simplicity we let Xi — cntijpi) and yi — cnt' ijpi), hence 
the a priori distribution concerning the states in D is given by 
X = {xo,x\,X2,X3}, and the distribution in D' is given by 
y = {j/o, i/i, J/2, yz}. Hence y contains the observed frequencies. 

Definition 7 (Transition Matrix M). The probability 
of transition for a tuple from an initial state i in D to a state j 
in D' is given by aij. The values a ji forms a transition matrix M. 

The values of aij are given in Figure[3] 
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Figure 3: State transition probabilities 



Let Pr{ri\x) be the probability that a tuple has state i in D' 
given vector x for the initial state distribution. The following can 
be derived. 



Pr{ro\x) 
Pr{ri\x) 
Pr{r2\x) 
Pr{r3\x) 



Hi 



Xl+ X3' 



Xo + 



-Xl 



1 f / Xi + X3\ 1 

^0 + -jtXi 



N W N 



1 /a a::i+a::3\ , ^' - 1 



_l_ 



Xl + X3 

N 



1 

X2 + JjX[i 



(6) 
(7) 
(8) 
(9) 



The above equations are based on the mechanism generating D' 
from D. Let us consider the last equation, the other equations are 
derived in a similar manner. For each true occurrence of (P, s), 
there is a ^7 probability that it will generate such an occurrence 
in D' . If there are X3 such tuples, then the expected number of 
generated instances will be xa/l' . 

Other occurrences of [P, s) in D' may be generated by the X2 
tuples satisfying P but with t.s 7^ s {P,s). Each such tuple t 
satisfies P for the non sensitive values and it is possible that s £ 
decoy s{t). We are interested to know how likely s £ decoy s{t). 

There are in total jr partitions. There can be at most one s tuple 
in each partition. Hence fs of the partitions contain s in the decoy 
set, and if a tuple t is in such a partition, then s £ decoys(i). The 
probability of having s in decoys (t) for a tuple t with t.s 7^ sis the 
probability that t is in one of the fs partitions above. Since mech- 
anism A' does not consider the NSA values in the randomization 
process, all such tuples t have equal probability of being in any of 
the fs partitions, and the probability is given by fs/jr ~ fsj^. 
Since fs=Xi+ X3, this probability is . 



The total expected occurrence of (P, s) is given by 
£' \ 



'X i + X3 „,\ X2_ 

N ) i' 



We can convert this into a conditional probability that a tuple 
in D' satisfies (P, s) given x, denoted by Pr{rs\x). This gives 
Equation l|9). 

Rewriting Equations l|6j to l|9]l with the transition probabilities in 
Figure |3] gives the following: 

3 

Pr(r,|a;) = ^a,,^ (10) 

3=0 

Equation dlOl ) shows that aji is the probability of transition for a 
tuple from an initial state j in D to a state iin D' . 

We adopt the iterative Bayesian technique for the estimation of 
the counts of a;o, ss. This method is similar to the technique in 
Il4i for reconstructing multiple column aggregates. 

Let the original states of tuples ti, ...,tN in D be Ui, Un, 
respectively. Let the states of the corresponding tuples in D' be 
Vi, Vn- From Bayes rule, we have 

P[Vk = J) 

Since Pr{Uk = i) — Xi/N, and Pr{Vk — j\Uk = i) = aij. 



Pr{Uk ^ i\Vk ^ j) 



^'J AT 



Er-=0 " 



(11) 



Pr{Uk = i) = ^ Pr{Vk = j)Pr{Uk = i|T4 = j) 

Hence, since Pr{Vk ~ j) = Vj/N, Pr{Uk = j) = Xj/N and 
from Equation i ll It . we have 



= E 



j=0 2^r = ""-J AT 

We iteratively update x by the following equation 



Vj 



al,xl 



3=0 



(12) 



We initialize x — y, and a;* is the value of x at iteration t. In 
Equation il2i . a\j refer to the value of aij at iteration t, meaning 
that the value of a\j depends on setting the values of a; — a;*. We 
iterate until x*"*"^ does not differ much from x*. The value of x at 
this fixed point is taken as the estimated x values. In particular x^, 
is the estimated count of (P, s). 

For the multiple attribute predicate counts, we also guarantee that 
privacy for small sums will not be jeopardized. 

Lemma 5. Let s be a sensitive value with a (e, Tp) privacy 
guarantee, then the count for a multiple column aggregate involv- 
ing s also has the same privacy guarantee. 

Proof: Without loss of generality, consider a multiple attribute 
aggregate of (P, s), where P G domain{NSA). Since the ran- 
domization of s is independent of the NSA attributes, the expected 
relative error introduced for (P, s) is the same as that for (P, s). 
The total expected error for (P, s) and (P, s) must not be less than 
that dictated by the (e, 7p) guarantee since otherwise the sum of 
the two counts will generate a better estimate for the count of s, 
violating the (e,Tp) privacy for s. Hence for (P, s) the privacy 
guarantee is at least {e,Tp). q 



7.2 Multiple sensitive attributes 

So far we have considered that there is a single sensitive at- 
tribute in the given dataset. Suppose instead of a single sensitive 
attribute (SA), there are multiple SAs, let the sensitive attributes 
be ^i, 5'2, ■■■Sw We can generalize the randomization process by 
treating each 5^4 independently, building decoy sets for each Si . 

For predicates involving {P, si, S2, s^}, where P is a set of 
values for a set of non-sensitive attributes, Si G domain{Si), there 
will he K — 2™+^ different possible states for each tuple. We let 
(P, si, S2, Sw) standfor (PAsi As2...As,„). For reconstruction 
of the count for (P, si, S2, Sw), we form a transition matrix for 
all the K = 2™"*"^ possible states. It is easy to see that the case 
of a single SA in Section lTTl is a special case where the transition 
matrix A4 is the tensor product of two matrices Mq and Mi , A = 
Mo Ml, where AIq is for the set of non-sensitive values and Mi 
is for si, and they are defined as follows: 



Mo 



1 
1 



M, 



1 /s^ fsj 

N N 



In general, with sensitive attributes Si, Sw, the transition ma- 
trix is given by M = Mq^ Mi...^ M^. 

Let the entries in matrix M be given by niij . We initialize a;" = 
y and iteratively update x by the following equation 



K-l 

E 

3=0 



T,r=0 < 



(13) 



In Equation i\3l . x' is the value of x at iteration t. ajj refer 
to the value of rriij at iteration t, meaning that the value of mlj 
depends on setting the values of a; = a:*. We iterate until a:*"*"^ does 
not differ much from a:'. The value of x at this fixed point is taken 
as the estimated x values. In particular Xk-i is the estimated count 

of (P, Si, Su,). 

8. BELIEF ABOUT AN INDIVIDUAL 

An adversary may be armed with auxiliary knowledge in the at- 
tack on the sensitive value of an individual. In general auxiliary 
knowledge allows an adversary to rule out possibilities and sharpen 
their belief about the sensitive value of an individual. For example, 
a linkage attack refers to an attack with the help of knowledge about 
another database which is linked to the published data. The other 
database could be a voter registration list, and it has been discov- 
ered that only the values of birthdate, sex and zip code are often 
sufficient to identify an individual |29 , 28|. 

In the design of ^-diversity t25 J , the set of tuples are divided into 
blocks and there should be £ well represented sensitive values in 
each block. The adversary needs l~l damaging pieces of auxiliary 
knowledge to eliminate £—1 possible sensitive values and uncover 
the private information of an individual. Our method is an improve- 
ment over the ^-diversity model since the possible sensitive values 
in our case is the entire domain of the sensitive attribute, including 
values that do not appear in the dataset. Hence if the domain size is 
m, the adversary would need m — 1 pieces of auxiliary knowledge 
to rule out m — 1 possible values, but in that case, the adversary 
knows a priori the exact value without examining D' . 

Another form of auxiliary knowledge is knowledge about the 
sanitization mechanism. Since many known approaches aim to 
minimize the distortion to the data, they suffer from minimality 
attack 131]. Our method does not involve any distortion minimiza- 
tion step and therefore minimality attack will not be applicable. 



9. EMPIRICAL STUDY 

We have implemented our mechanism A' and compared with 
some existing techniques that are related in some way to our 
method. 

For step 2 of mechanism A', we need to partition tuples in Ds 
into sets of size I' each and each partition contains £' different sen- 
sitive values. We have adopted the group creation step in the algo- 
rithm for Anatomy |34|. In this algorithm, all tuples of the given 
table are hashed into buckets by the sensitive values, so that each 
bucket contains tuples with the same SA value. The group creation 
step consists of multiple iterations. In each iteration a partition 
(group) with £' tuples is created. Each iteration has two sub-steps: 
(1) find the set L with the £' hash buckets that currently have the 
largest number of tuples. (2) From each bucket in L, randomly 
select a tuple to be included in the newly formed partition. Note 
that the random selection in step (2) can be made deterministic by 
picking the tuple with the smallest tuple id. 

9.1 Experimental setup 

The experiments evaluate both effectiveness and efficiency of 
mechanism A' for f'-diverted privacy. We also compare our 
method with three other approaches. Anatomy for ^-diversity, dif- 
ferential privacy by means of Laplacian perturbation, and global 
randomization (mechanism A). Our code is written in C-l~l- and ex- 
ecuted on a PC with CORE(TM) 13 3.10 GHz CPU and 4.0 GB 
RAM. The dataset is generated by randomly sampling 500k tu- 
ples from the CENSUqj dataset which contains the information for 
American adults. We further produce five datasets from the 500k 
dataset, with cardinalities ranging from 100k to 500k. The default 
cardinality is 100k. Occupation is chosen as the sensitive attribute, 
which involves 50 distinct values. 

In the experiment we consider count queries, which have 
been used for utility studies for partition-based methods |34| and 
randomization-based methods |27|. A pool of 5000 count queries 
is generated according to the method described in Appendix 10.9 
in |7|. Specifically, we generate random predicates on the non- 
sensitive attributes, each of which is combined with each of the 
values in the domain of the sensitive attribute to form a query. 
We count the tuples satisfying a condition of the form Ai = 
vi A ... A Ad = Vd /\ SA = Vs, where each Ai is a distinct non- 
sensitive attribute, 5*^1 is the sensitive attribute, and the Vi and Vs 
are values from the domains of Ai and SA, respectively. The se- 
lectivity of a query is defined as the percentage of tuples that satisfy 
the conditions in the query. For each selectivity s that is considered 
we report on the average relative error of the estimated count for 
all queries that pass the selectivity threshold s. In later analysis, we 
group queries according to their distinct selectivities. 

Given queries in the pool, we calculate the average relative error 
between the actual count (from the original dataset) and estimated 
count (from the published dataset) as the metric for utility. As 
discussed earlier, we differentiate between small counts and large 
counts. Specifically, we vary the selectivity (denoted by s, which is 
the ratio of the actual count to the cardinality of dataset) from 0.5% 
up to 5% for large counts. For small counts, we require the actual 
count to be no more than 10 (selectivity less than 0.1%). We eval- 
uate the influence of various £' values, and also the cardinalities of 
dataset on the utility. To assess the efficiency, we record and show 
the running time of our data publishing algorithm. 

9.2 Utility for large counts 

First we shall examine the impact of varying £' , while we have 
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Figure 4: Relative error 
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Figure 5: Relative error for small counts 

separate plot for distinct selectivity. In particular, the average rel- 
ative error is computed for £' ranging from 2 to 10, as shown in 
Figure |4] where selectivity of large counts is concerned. For large 
selectivity (i.e., large counts) between 2% and 5%, the error is as 
low as 20%. The error is also bounded by 40% for other selectiv- 
ities, which is acceptable. Another observation is a trend that, as 
£' increases, the error for most selectivities first decreases but soon 
start to rise. This can be explained by the fact that more restricted 
privacy (larger £') requirement may compromise the utility. For the 
special case where the query involves only the sensitive attribute, 
the relative errors of both small and large counts are shown in Fig- 
ure|6l The results agree with our analysis in Section[6] The relative 
error is as well shown against the selectivity in figure |7]. 

9.3 Error for small counts 

We plot the error of queries with small counts separately in Fig- 
ure|5] where the counts are smaller than 10. As one can observe, the 
error is sufficiently high to ensure privacy, consistent with our re- 
quirement that answer for small count should be inaccurate enough 
to prevent privacy leakage. The relative error also displays a posi- 
tive linear correlation with £'. In other words, as £' becomes bigger 
(higher privacy), privacy for small counts is also ensured at a higher 
level. 

9.4 Comparison with other models 

To our knowledge there is no known mechanism for f '-diverted 
privacy. We would like to compare the utilities of our method with 
other models although it is not a fair comparison since our method 
provides guarantees not supported by the other models. We have 
chosen to compare with Anatomy because we have used a similar 
partitioning mechanism, and Anatomy is an improvement over pre- 
vious ^-diversity methods since it does not distort the non-sensitive 
values. We compare with the distortion based differential privacy 
method since it has been the most vastly used technique in differ- 
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Figure 7: Relative error versus selectivity for SA querying 
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ential privacy. Finally we shall compare with the global random- 
ization mechanism A described in Section[3] We shall see that our 
method compares favorably with the other methods in terms of util- 
ity while addressing the dilemma of utility versus privacy. 

To compare with the Anatomy method, we set both (! in t! - 
diverted and I in Anatomy to the same value, (! = I = 5. The 
answers for Anatomy are estimated using the method in 1341 . We 
then choose different s and A'^ (sizes of dataset) to evaluate their 
performance. The average relative errors for Anatomy and mech- 
anism A! are shown in Figures [8] and |9l respectively. The overall 
error of our method appears smaller than that of Anatomy for most 
choices of A'^ and s. The error is bounded by 30% for mechanism 
A! and can be over 40% for Anatomy. We can also get an idea of 
the influence of different cardinalities of dataset on the error. In 
fact, the error does not show an obvious correlation with TV. 

Typical differential privacy secures privacy by adding noises to 
the answers . Given a set of queries gi , . . . , , e-differential privacy 
can be achieved by a randomization function with a noise distribu- 
tion of ^fi/^) 1131 . Since m is the maximum number 
of queries that can be submitted to D', we first set m to be 100, 
and we choose the e parameter in the Laplacian noise to be 0.01 
and 0.05, which are normal choices found in the literature. The 
100k dataset is used, and the average relative error is shown for s 
between 1% and 5% in Figure [TOl The error from differential pri- 
vacy, no matter which e is chosen, will become unacceptably large 
for smaller s. On the other hand, the impact of s is limited in the 
case of our method, the result of which is labeled "^'-diverted" in 
the graph. To see how m, the number of queries raised, affects the 
utility, we plot the relative error against m valued from 10 to 100 in 
Figure [TT] Obviously the relative error from our method does not 
depend on m, while that from differential privacy grows linearly 
with m, and become very large for large m. 

The results for the global randomization Mechanism A is shown 



in Figure[To] We set the value of p to so that the probability to 
retain the original sensitive value in each tuple is the same in both 
methods. It can be seen that our method has much better utility for 
all the selectivities in our experiment. 

9.5 Multiple sensitive values 

We also consider the utility in scenarios where a query involves 
more than one sensitive value. To this end, we choose Age and 
Occupation as the sensitive attributes. The two sensitive attributes 
are randomized independently and then combined for data publica- 
tion. To allow queries of large selectivities, we first generalize the 
domain of Age into ten intervals; without this step, most of the re- 
sulting counts are too small and the range of selectivities is limited. 
The relative error for multiple-dimension aggregates involving two 
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Figure 10: Comparison of our method (^'-diverted) with differ- 
ential privacy and global randomization by mechanism A 
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Figure 11: Multiple queries in differential privacy 
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Figure 12: Relative error for 2 sensitive attributes 



sensitive attributes is sliown in Figure[T2l wliere £' ranges from 2 to 
8. Altiiougii given tiie diminisiied selectivities (0.1% to 0.7% for 
this case), tlie overall accuracy can match that in single-sensitive- 
attribute scenario. 

9.6 Computational overhead 

The computational overhead mainly comes from the partitioning 
process. We have adopted the partitioning method of Anatomy. 
This algorithm can be implemented with a time complexity of 
0{N{1+ jr)), where A'^ is the cardinality of the table, and V is the 
number of distinct values of the sensitive attribute. We show the 
running time for the case of single sensitive attribute on the largest 
500K dataset, varying £' from 2 to 10. For all chosen £' values, our 
algorithm can finish within 10 seconds for a 500K dataset, which 
is practical to be deployed in real applications. 

We also consider the querying efficiency at the user side. To esti- 
mate the answer, a user will compute each component of the vector 
y, and do matrix multiplications to iteratively converge at the an- 
swer X. When each component of y changes by no more than 1%, 
we terminate the iteration and measure the querying time and num- 
ber of iterations. In our experiments, SQLITE4J serves for query- 
ing y, and we consider the case with two sensitive attributes which 
involves the most number of components in y, implying the largest 
computational cost. The result shows that the Bayesian iterative 
process takes negligible time, while the major cost comes from the 
querying step. In particular, it takes less than 1 ms in average, and 
10 ms in the worst case, for the iterative process to converge. The 
median and average of the number of iterations is 16 and 325, re- 
spectively. In total, the average measured time for a query is 1612 
ms, which poses little computational burden on users. 



"See http://docs.python.org/library/sqlite3.html 



10. RELATED WORK 

Differential privacy has been a break-through in the study of 
privacy preserving information releases, e-differential privacy has 
been introduced for query answering and the common technique 
is based on distortion to the query answer by a random noise that 
is i.i.d. from a Laplace distribution and calibrated to the sensitiv- 
ity of the querying fTS^ 12 1 . Laplace noise has been used in many 
related works on differential privacy including recent works on re- 
ducing relative error |33| and the publication of data cubes in flO). 
Since the data release can be for different purposes, in some tasks, 
the addition of noise makes no sense. For example, a utilization 
function might map databases to strings, strategies, or trees. The 
problem of optimizing the output of such a function while preserv- 
ing e-differential privacy is addressed in [26|. For database pub- 
lication, | 6| shows that given a large enough dataset, a synthetic 
database can be generated that is approximately correct for all con- 
cepts in a given concept class; the minimal data size depends on the 
quality of the approximation, the log of the size of the universe, the 
privacy parameter e and the Vapnick-Chervonenkis dimension of 
the concept class. Further results can be found in 1 17 1 . In most pre- 
vious works, the definition of error is an absolute error 1111 1161 |6l 
[TtI. The algorithm iReduct in |33 | considers relative errors and in- 
jects noise to query results according to the values of the results. A 
recent work |23 | points out that differential privacy may not guar- 
antee privacy when deterministic statistics have been previously 
published. In contrast we consider a more basic possible privacy 
leak which is due to the fact that differential privacy does not aim 
to protect information that can be derived from the published data, 
deeming such a task impossible. All previous works on differential 
privacy consider e-differential privacy for non-zero e values. None 
of the works in the above considers the guarantee of protection of 
small sums, which is a major objective in our mechanism. 

In the literature of statistical databases, the protection of small 
counts has been well-studied in the topic of security in statistical 
databases 1 1 1. A concept similar to ours is found in |30| where the 
aim is to ensure that the error in queries involving a large number 
of tuples will be significantly less than the perturbation of individ- 
ual tuples. It has been pointed out in previous works |20 21] that 
the security of a database is endangered by allowing answers to 
counting queries that involve small counts, i.e. the number of tu- 
ples involved in the query is small. In |9|, random sampling has 
been used to ensure large errors for small query set sizes. However, 
these previous works are about the secure disclosure of statistics 
from a dataset and do not deal with the problem of sanitization of 
a dataset for publication, and they have not considered the guar- 
antee of differential privacy. Discriminative privacy protection has 
been considered in some previous work in privacy preserving data 
publication such as [35. ,37 J , however, such works are based on 
personalized privacy requirements. There have been studies that 
the utility of published dataset can lead to privacy breach 1221 1321 , 
however, they focus on partition-based methods for ^-diversity and 
they have pointed out the problems while no solution is proposed. 

Randomization technique has been used in previous works in pri- 
vacy preservation. The usefulness of such a technique is shown 
in 1 3 1, where the published data is used to build a decision tree 
which achieves classification accuracy comparable to the accuracy 
of classifiers built with the original data. An effective reconstruc- 
tion method for data perturbation is introduced in [2|. In |4|, ran- 
dom perturbation is adopted for privacy preserving computation 
for multidimensional aggregates in data horizontally partitioned at 
multiple clients. Randomization of transaction datasets for the min- 
ing of association rules has been considered in [19J. 



11. CONCLUSION 

We have introduced a new mechanism in the problem of privacy 
preserving data publication with the following properties. Firstly, 
it satisfies ^'-diverted zero-differential privacy, which makes sure 
that the resulting data analysis will have no difference whether an 
individual keeps its true sensitive value or swap the true value with 
other individuals. Secondly, the randomization process makes use 
of the law of large numbers in ensuring that large counts, which are 
not as sensitive, can be estimated with high accuracies while the 
small counts will be hidden by relatively large errors. Our method 
is parameter free except for the value of however, the choice of 
(! has little effect on the privacy and as shown in our experiments, 
setting t! to 5 or above will do well in terms of the utilities. Fur- 
thermore, the sensitive value of a tuple in the published data can 
be any value in the attribute domain, so the mechanism is resilient 
to auxiliary knowledge which eliminates possible values. Our em- 
pirical studies on a real dataset show superior utility performance 
compared to other state-of-the-art methods which do not have the 
above guarantees. For future work, we may consider how to handle 
skewed sensitive microdata |36|. Another direction for future work 
is to consider mechanisms such as small domain randomization for 
further boosting the utilities for large counts |7|. The consideration 
of sequential data releases is another open problem. 

As a final remark, all existing privacy models inherently release 
information that can be derived from the published datasets, and the 
same is true with our approach. It is important to make known to 
the users what kind of information they should expect to be released 
or derivable. In our case, it will be relatively accurate answer to 
queries with large sums. 
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